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(57) Abstract 



A teleconference system (200) is disclosed for digitally recording and playing a conference telephone call that includes a plurality of 
intervals. The teleconference system includes a skim server (55) that detects a first set of the plurality of intervals and a conference bridge 
(100) that detects a second set of the plurality of intervals during the conference call. An interval database server (65) generates labeled 
interval data for all detected intervals and stores the labeled interval data in a database. The labeled interval data includes an interval data 
element that defines each interval. After the conference call is recorded, the labeled interval data can be searched and retrieved based 
on assorted criteria. Portions of the recorded conference call associated with the retrieved labeled interval data can also be retrieved and 
played back. This facilitates easy retrieval and playback of desired portions or a recorded conference call. Further, during playback of the 
conference call (85), a user interface is generated. The user interface displays the stored labeled interval data. A user can easily select or 
skip to desired portions of the conference call by selecting portions of the user interface. 
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METHOD AND APPARATUS FOR STORING AND RETRIEVING 
LABELED INTERVAL DATA FOR MULTIMEDIA RECORDINGS 



10 . HELD OF THE INVENTION 

The present invention is directed to storage and retrieval of 
multimedia data. More particularly, the invention is directed to storage 
and retrieval of labeled interval data in a database. 

15 BACKGROUND OF THE INVENTION 

Unlike records of written communications, records of speech 
communication are rarely recorded, let alone stored, even though storage 
of digital speech may be readily achieved. It is presently feasible to store 
gigabytes and even terabytes of digitally recorded speech or other types of 

20 multimedia information (e.g., video). Other than for archival purposes, 

there is no practical reason for storing such data without having a 
mechanism by which a user can identify and retrieve only those portions of 
the stored data, which may be of interest. 

The difficulty inherent in searching and retrieving digital speech 

25 records stored in a database stems from the traditional approaches to 

querying a database to locate particular records. Most database queries are 
logical queries based upon the presence or absence of specified 
characteristics in the records being searched. Boolean logic and fuzzy 
logic have been used to increase the utility of database queries, but these 
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techniques merely extend the fundamental basis of most typical database 
queries, whether one or more terms, indices, or other identifying 
characteristics are present (or absent) in the records being searched. 

Digital speech records, without being converted into text by speech- 
5 to-text conversion or transcription or otherwise parsed cannot be located 

and/or identified using traditional database query techniques as it is not 
practical to determine whether a word (or phrase) appears in a selected 
portion of recorded speech. Therefore, review of non-transcribed digital 
speech records is frequently limited to listening to the digitally recorded 

10 speech until the item or items of interest are heard. Unfortunately, this 

frequently requires listening to a considerable degree of extraneous or 
irrelevant speech which can be extremely time-consuming without 
providing any significant elucidation. Moreover, digital speech records 
frequently contain lengthy pauses and, if the digital speech record is 

15 between more than two speakers, it is frequently difficult, if not 

impossible, to identify the speakers, further exacerbating the problem of 
identifying a specific segment in recorded digital speech. 

Even when a digital speech record is divided into separate digital 
recordings, and each recording is individually accessible and identified, the 

20 digitally recorded data is of limited use. For example, if ten conference 

calls were recorded in a digital storage medium, a user might be able to 
locate a particular conference call on a particular date, if the user were 
fortunate enough to know that the information he or she sought was in that 
specific conference call. Even, so, the user would still have to listen to the 

25 entire recording of the conference call. For a user seeking to identify a 

specific comment made by a specific participant to the conference call, it is 
extremely inefficient for the user to have to listen to the entire conference 
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call. Moreover, if the user does not know the specific date and time of the 
conference call in which the person spoke, the user might have to listen to 
several conference call recordings before finding the desired information. 
Clearly, as soon as a greater than minimal number of recordings were 
5 stored, it becomes impractical for a user to locate desired information 

merely by listening to the conference call recordings. 

Based on the foregoing, there is a need for a method and apparatus 
for readily identifying, locating, and retrieving stored digital speech and 
other digital multimedia records. 

10 

SUMMARY OF THE INVENTION 

One embodiment of the present invention is a teleconference system 
for digitally recording and playing a conference telephone call that includes 
a plurality of intervals. The teleconference system includes a skim server 

15 that detects a first set of the plurality of intervals and a conference bridge 

that detects a second set of the plurality of intervals during the conference 
call. An interval database server generates labeled interval data for all 
detected intervals and stores the labeled interval data in a database. The 
labeled interval data includes an interval data element that defines each 

20 interval. After the conference call is recorded, the labeled interval data can 

be searched and retrieved based on assorted criteria. Portions of the 
recorded conference call associated with the retrieved labeled interval data 
can also be retrieved and played back. This facilitates easy retrieval and 
playback of desired portions of a recorded conference call. 

25 Further, during playback of the conference call, a user interface is 

generated. The user interface displays the stored labeled interval data. A 
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user can easily select or skip to desired portions of the conference call by 
selecting portions of the user interface. 

BRIEF DESCRIPTION OF THE DRAWINGS 
5 Fig. 1 illustrates a teleconference system in accordance with one 

embodiment of the present invention. 

Fig. 2 illustrates the format of an interval data element that forms 
the labeled interval data associated with a recorded conference. 

Fig. 3 illustrates a conference playback document in accordance 
10 with one embodiment of the present invention. 

Fig. 4 illustrates in detail how overlapping intervals are displayed. 



DETAILED DESCRIPTION 

15 In one embodiment of the present invention, intervals within 

recorded digital speech or other multimedia data are specifically identified 
and labeled. The labeled interval data provides a mechanism by which a 
user can specifically identify an interval within digitally recorded 
multimedia, and having identified that interval, retrieve it and other 

20 intervals sharing desired characteristics. 

Fig. 1 illustrates a teleconference system in accordance with one 
embodiment of the present invention. Teleconference system 200 records 
and stores a teleconference call and associated labeled interval data. 
Teleconference system 200 further allows a recorded teleconference to be 

25 played back using the stored labeled interval data. 
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The main components of teleconference system 200 are a 
conference recorder 110, a skim server 55, an interval database CTD&") 
server 65, and a Java user interface 85. 

In teleconference system 200, a plurality of telephones 31, 32, and 
33 aire interconnected through the public switched telephone network 
("PSTN") 40. One or more individuals may participate in a teleconference 
through each telephone 31-33. The participants may be identified by the 
telephone they are calling from or, alternatively, by voice recognition or 
other forms of identification during the teleconference. 

A teleconference may be initiated by a conference host accessing a 
WebRoom interface on a WebRooms server 50. A WebRoom interface 
provides a mechanism by which participants may be actively added to 
and/or deleted from a teleconference. In one embodiment, the WebRoom 
interface for all teleconference participants is implemented as Common 
Gateway Interface ("CGI") program 60 on an HyperText Transport 
Protocol Web Server ("Httpd") 70 that provides interactive control of the 
teleconference through Hyper-Text Markup Language ("HTML") 
documents. The HTML documents are accessible as conference pages 80 
through a Web browser 90 such as Netscape® Navigator or Internet 
Explorer*. 

At record time, the conference host uses WebRooms server 50 to 
dial a conference scribe. The conference scribe acts as an additional 
participant to the teleconference. At the same time, conference recorder 
110 tells IDB Server 65 to create a new collection point, referred to as a 
"depot" for storing all data related to this particular recording, and it tells 
skim server 55 to begin recording an audio file using, for example, a 
Dialogic board 57 from Dialogic Corp., or its equivalent. A depot in 
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10 



teleconference system 200 can be a structured query language ("SQL") 
database 35 coupled to an Open DataBase Connectivity ("ODBC") 
interface 36. While the conference is running, conference bridges 100 
detects call control events (e.g., which participant is talking, new 
participants being added, etc J and sends these events through WebRooms 
server 50 and conference recorder 110 into the new depot (i.e., SQL 
database 35). Meanwhile, skim server 55 detects pauses in speech and 
adds these events as well to the depot. The events detected by both 
conference bridges 100 and skim server 65 are referred to as "intervals". 

When playing back a recorded conference on teleconference system 
200, the user brings up a Java user interface 85 to select a recording 
accessed via IDB server 65. The user interface 85 retrieves labeled 
interval data for the recording and uses them to display a visual time-line 
of events. The user enters a phone number that is passed to Skim Server 
!5 55 so it can call the user's telephone for conference playback through 

Dialogic board 57. As the audio plays on the user's phone, Java user 
interface 85 continuously updates the graphical display and controls how 
the recording is played using skim server 55. All clients like Java user 
interface 85 and conference recorder 110 communicate with skim server 55 
20 and IDB server 65 through a CORBA application programming interface in 

one embodiment of the present invention. CORBA was chosen because it 
allows a simple interface between programs written in different languages 
running on different platforms. Both servers 50 and 55 and conference 
recorder 1 10 are written in C+ + and run on Sun Solaris platforms in one 
25 embodiment of the present invention. 

Skim server 55 performs the following functions: 
1- Records audio from telephone line to file. 
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2. Detects speech events while recording and posts them to the 
database. 

3. Plays from file to telephone line 

- from any point in recording 
5 - in variable speeds 

- with pauses removed or not. 

In one embodiment, skim server 55 is based on the same type of 
hardware as standard voice mail servers, and it performs many of the same 
functions. One difference between skim server 55 and a more traditional 

10 voice mail server is that it processes speech events and posts them to IDB 

server 65, and also that it provides fine control over what parts of the 
audio file are played and what parts are skipped. 

One function of IDB server 65 is to store and retrieve labeled 
interval data associated with a recorded conference. This is data that 

15 describes properties about specific intervals within the speech, such as who 

is talking, pauses in speech, telephone call control data, etc. This can be 
further extended to applications that require intervals that mark video scene 
changes, or relate automatic speech recognition output to a recording. The 
labeled interval data can be created, stored, and retrieved by a number of 

20 different applications. Some are automatically derived from raw speech 

data, some are side effects of user activity, and others may be entered 
manually at record time or at playtime. 

Fig. 2 illustrates the format of an interval data element 130 that 
forms the labeled interval data associated with a recorded conference. 

25 Every interval during the recorded conference will be associated with an 

interval data element 130. In one embodiment, each interval data element 
130 includes the following: 
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Recording ID or Depot 122: Refers to the recording that is 
associated with the interval and the collection point where 
the recording is stored. 

Start time 123: Applications need both absolute time and 
time relative to recording start time. Relative time is more 
compact, and it is easy to convert to absolute as long as an 
absolute start time is stored with the recording. 
Duration or end time 124. 

Type: A code to identify the meaning of this interval. Is it 
a pause in speech, a scene change, etc.? 
Type-specific data values 126: Depending on the type, this 
data could be a string of text, a number, a URL, etc. 
Labeled interval data must be able to be stored, retrieved, and 
manipulated more than one at a time. Some applications will deal with 
15 large collections of intervals that share everything except start time and end 

time (e.g., all times when a specific person was speaking). 

Applications must be able to store interval data in the database at 
any time: before recording has begun, during recording, and after. For 
example, for a teleconference it may be necessary to record caller-id and 
20 ringing events before the call, record who is speaking during the call, and 

make annotations about the call afterwards. Some applications need to 
display incomplete interval data while a recording is in progress (e.g., 
catch up to live conference), so it should be possible to post an interval that 
has started but not ended yet, and post the end time later. It should also be 
25 possible to adjust interval data, for example to realign them with other 

data. 



10 



3. 
4. 



WO 99/17235 



9 



PCT/US98/20446 



10 



All applications that post events to IDB server 65 must specify 
precise millisecond offsets for start and end times of each interval. All 
offsets are from an absolute start-time for the recording. Posting intervals 
from different machines in real-time requires all clients that are posting 
events have synchronized clocks, so standard network time protocol 
( a NTP w ) software is run on all of these machines. 

Browse, search, and playback applications need to query and 
display subsets of interval data. Examples of queries that can be supported 
by the present invention include: 

• All interval data for a specific recording, sorted by time and 
type. 

• All intervals of a specific type with specific values, or 
values within a particular range. 

• Intervals within an absolute or relative time range. 
15 • Intervals of a specific duration. 

The present invention provides for logical/set operations. For 
example, assume a user wants to see and/or hear only the parts of a 
recording when person A or person B was talking, and wants to leave all 
the pauses out. This can be expressed by making three queries: intervals 

20 when A was speaking (set A), intervals when B was speaking (set B), and 

pause intervals (set P). The desired set can be expressed as "A union B 
less P", or if these sets are thought of as long bit masks, then they can be 
described as logical operations: (A B)&( P). 

Some types of intervals may not have clear start and end times. 

25 Instead of a binary on/off state at each time increment, some data has an 

associated probability curve over time because the exact times of the events 
are not certain. Output from automatic speech recognition (e.g., phoneme 
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lattices) can include several overlapping hypotheses about what words are 
being said at any given moment. In one embodiment of the present 
invention, IDB server 65 provides support for "fuzzy n intervals. In 
another embodiment, IDB server 65 uses binary intervals along with a 
5 probability value in the type-specific numeric data field to achieve a similar 

effect as fuzzy intervals, but without fuzzy logical operations. 

Transcriptions can be stored as interval data, perhaps one sentence 
per interval, or one word per interval depending on how fine a mapping is 
desired between words and time. The transcriptions may be produced from 

10 close caption text, higher quality off-line transcriptions, or a lower quality 

automatic speech recognition system. 

Teleconference system 200 provides playback of recorded 
conferences using conference playback documents. The system utilizes 
stored labeled interval data associated with the conference. Fig. 3 

15 illustrates a conference playback document 300 in accordance with one 

embodiment of the present invention. Conference playback document 300 
is implemented as a Java applet through Java user interface 85 of Fig. 1 . It 
uses a visual structuring of the recording as a series of color-coded 
intervals (e.g., intervals 305 and 310) plotted on a horizontal time axis in 

20 an area referred to as a time-line window 315. Each participant in a call 

(e.g., participants 316-320) is allocated a separate time-line for graphically 
depicting all labeled intervals that are associated with that person (e.g., 
dialing, connected, muted, talking, etc.). 

Fig. 4 illustrates in detail how overlapping intervals are displayed. 

25 As shown in Fig. 4, by plotting each interval type one at a time, starting 

with taller bars, the document displays overlapping intervals on the same 
line. 
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Referring again to Fig. 3, intervals that are not associated with an 
individual person are plotted separately above the participants, (e.g., 
hyperlinks 330, speech segments, etc.). Time-line window 315 provides a 
snapshot of every participants* activity, and can be used to navigate 
5 through the recording. 

In one embodiment, once users have established a phone connection 
to the recorded conference player, they can use a tool bar 350 below the 
time-line to begin playing the audio and adjust the skimming parameters. 
In another embodiment, a separate phone connection is not necessary 
0 because the audio conference recording can be "streamed" in conjunction 

with conference playback document 300. 

Toolbar 350 provides five buttons to control the player: "goto 
beginning 351", "jump back 352", "stop 353", "play 354 tt , and "jump 
forward 355". It also contains a slider 356 for adjusting the playing speed 

5 (0.7x, l.Ox, 1.3x, 1.7x, and 2.0x), a zoom menu 357 for selecting the 
zoom factor (none, 20min., lOmin., and 5min.), and an on/off pause 
button 358 for pause removal. 

As the recorded conference audio plays, a vertical red needle 360 
moves across the time-line. When needle 360 moves, every participant's 

6 name tag is colored to reflect that person's state at that time in the meeting. 
Fig. 3 shows a one hour conference with the entire duration visible (zoom 
= none). In this view, the visual structures help make some details of the 
call immediately obvious. For example, the number and span of the light 
colored bars can identify the most/least dominant talkers. The initial long 

> uninterrupted talking bands show who gave the formal presentations. 

Finally the point where the question and answer session began is visible 
roughly half way into the call, where many short talking intervals are 
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scattered among many participants. More detailed information must be 
found by either listening to the audio or by searching through linked 
annotations, images, and other documents. 

The zooming feature allows the user to narrow the duration 
5 displayed in the time-line window. A numbered scroll bar allows the user 

to register the zoomed-in portion with the fall duration, and scroll using 
mouse clicks or arrow keys on the keyboard. Scrolling is independent of 
player location needle 360, so the user can separately glance at regions, 
without disrupting listening. Player needle 360 can be moved by clicking 

10 on the time-line, or by pressing a jump forward/backward button. When 

this happens, the skim server plays a short non-speech audio cue and 
begins to play at the new location. 

Clicking the time-line near the top is used to select hyperlinks (e.g., 
link 330) rather than to move the needle. When a link is selected, or a 

15 "links* button 340 is pressed, a dialog displays all the links in the 

recording. This dialog can be used to visit a link, edit a link, or create a 
link both in and out of the time-line. One embodiment of the present 
invention supports the following types of links: annotations, audio, 
documents, images, and general URL. All links are implemented using 

20 URLs except annotations, which store textual content as interval data. 

Each type of link is displayed on the time-line with a representative icon. 

Hyperlinks into and out of the time-line are stored as intervals, and 
contain both a beginning and ending time offset. Thus a link can refer to a 
particular point or region of the time-line, allowing a rich set of skimming 

25 alternatives. For example, following a link can cause play to begin at a 

certain point, end at a certain point, or sequence through selected regions. 
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This means that following a link can have multiple effects, including 
moving the player needle and changing the document page. 

As disclosed, one embodiment of the present invention is a 
teleconference recorder and player. When a conference is recorded, an 
5 interval database stores labeled interval data associated with the 

conference. The labeled interval data allows searching and retrieving of 
the recorded conference, and facilitates playback of the recorded 
conference. 

Several embodiments of the present invention are specifically 
10 illustrated and/or described herein. However, it will be appreciated that 

modifications and variations of the present invention are covered by the 
above teachings and within the purview of the appended claims without 
departing from the spirit and intended scope of the invention. 

For example, although the embodiments disclosed are implemented 
15 over the Internet, the present invention can be implemented using a private 

network, or using any other known or future data communication methods. 
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1 WHAT IS CLAIMED IS : 

2 1. A system for recording and playing multimedia data that 

3 includes a plurality of intervals, said system comprising: 

4 a skim server that detects a first set of the plurality of intervals; 

5 an interval database server coupled to said skim server, said interval 

6 database server generating labeled interval data for the first set of the 

7 plurality of intervals detected by said skim server; and 

8 a database coupled to said interval database server and storing said 

9 labeled interval data; 

10 wherein said labeled interval data comprises an interval data 

11 element for each of the detected plurality of intervals. 

12 2. The system of claim 1, further comprising: 

13 a conference bridge coupled to said interval database server that 

14 detects a second set of the plurality of intervals; 

15 wherein said interval database server further generates labeled 

16 interval data for the second set of the plurality of intervals detected by said 

17 skim server. 

18 — 3. The system of claim 2, wherein said first set of the plurality of 

19 intervals comprise pauses in speech. 

20 4. The system of claim 2, wherein said second set of the plurality 

21 of intervals comprise call control events. 

22 5. The system of claim 1, wherein the multimedia data comprises a 

23 conference telephone call. 

24 6. The system of claim 1, wherein said interval data element 

25 comprises: 

26 a type of the detected interval; 
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27 a start time of the detected interval; and 

28 a duration of the detected interval. 

29 7. The system of claim 6, wherein said interval data element 

30 further comprises: 

31 a recording identification of the detected interval; and 

32 a type-specific data value of the detected interval. 

33 8. The system of claim 1, wherein said interval database server 

34 comprises: 

35 means for searching said stored labeled interval data. 

36 9. The system of claim 8, wherein said interval database server 

37 further comprises: 

38 means for retrieving said stored labeled interval data and associated 

39 multimedia data. 

40 10. The system of claim 5, further comprising: 

41 a user interface generated during playback of the conference call, 

42 wherein said user interface displays the stored labeled interval data. 

43 11. A method for recording and playing multimedia data that 

44 includes a plurality of intervals, said method comprising: 

45 detecting the plurality of intervals; 

46 generating labeled interval data for the plurality of intervals; and 

47 storing the labeled interval data in a database; 

48 wherein said labeled interval data comprises an interval data 

49 element associated with each of the plurality of intervals. 

50 12. The method of claim 1 1 , wherein said interval data element 

51 comprises: 

52 a type of the associated interval; 

53 a start time of the associated interval; and 



WO 99/1 7235 



16 



PCT/US98/20446 



54 a duration of the associated interval. 

55 13, The method of claim 12, wherein said interval data element 

56 further comprises: 

57 a recording identification of the associated interval; and 

58 a type-specific data value of the associated interval. 

59 14. The method of claim 11, further comprising: 

60 storing the multimedia data in the database. 

61 15. The method of claim 14, further comprising: 

62 querying said database based on one or more labeled interval data 

63 parameters; and 

64 retrieving at least one interval data element and associated 

65 multimedia data from the database. 

66 16. The method of claim 11, wherein the multimedia data 

67 comprises a conference telephone call. 

68 17. The method of claim 16, further comprising: 

69 generating a user interface that displays the labeled interval data; 

70 and 

71 playing the conference call based on selections of the user interface. 

72 18. A method of recording and playing a teleconference telephone 

73 call, said method comprising: 

74 detecting a plurality of intervals during the telephone call; 

75 generating labeled interval data for each of said plurality of 

76 intervals; and 

77 storing said labeled interval data in a database. 

78 19. The method of claim 18, wherein said labeled interval data 

79 comprises a plurality of interval data elements, said method further 

80 comprising: 
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81 querying said database and retrieving one or more of the stored 

82 interval data elements; and 

83 playing a portion of the teleconference telephone call that is 

84 associated with each of said retrieved interval data elements. 

85 20. The method of claim 18, wherein said detected intervals 

86 comprise: 

87 an identity of a speaker; 

88 pauses in speech; and 

89 telephone call control. 
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